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Abstract 

Comparative  tests  are  commonly  used  during  the  operational  testing  phase  to 
baseline  the  system  under  test  (SUT)  against  the  current  status  quo.  Depending  on 
the  type  of  SUT,  the  comparative  test  may  be  costly  and  resource  intensive.  Thus, 
any  insights  which  may  be  gleaned  about  the  potential  results  of  the  test  beforehand 
may  provide  guidance  on  (1 )  the  potential  benefits  of  conducting  the  test  and  (2)  the 
structuring  of  the  test.  This  paper  offers  a  statistical  approach  to  understanding  the 
type  of  results  which  may  emerge  during  comparative  testing  of  the  SUT. 

Specifically,  we  utilize  the  concept  of  statistical  inference  to  determine  the  needed 
performance  difference  between  the  SUT  and  the  baseline  system.  If  performance 
differences  are  statistically  different,  there  may  be  useful  information  to  be  gained 
from  conducting  the  test  as  is.  Performance  differences,  which  are  not  statistically 
different,  may  indicate  that  the  test  should  be  restructured  or  postponed.  In  either 
case,  the  relevant  decision-maker  is  provided  with  information  about  the  potential 
results  of  the  test  beforehand  in  order  to  make  an  informed  decision.  We  illustrate 
the  method  of  statistical  inference  on  a  system  which  improves  situational  awareness 
on  the  battlefield.  We  define  a  number  of  comparative  metrics  used  to  evaluate  the 
operational  effectiveness  of  the  baseline  system  and  the  SUT.  From  the  notional 
situational  awareness  system  presented  in  this  paper,  we  demonstrate  the  insights 
which  may  be  gleaned  and  the  implications  for  operational  testing  using  statistical 
inference. 
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Introduction 

Comparative  tests  are  used  during  the  operational  testing  phase  to  baseline  the 
system  under  test  (SUT)  against  the  current  status  quo.  Depending  on  the  type  of  SUT  and 
test  complexity,  the  comparative  test  may  be  costly  to  administer  and  challenging  to  repeat. 
Thus,  any  insights  which  may  be  gleaned  about  the  potential  results  of  the  test  beforehand 
may  provide  guidance  on  (1 )  the  potential  benefits  of  conducting  the  test  and  (2)  the 
structuring  of  the  test.  From  the  relevant  decision-maker’s  perspective  (e.g.,  Office  of  the 
Under  Secretary  of  Defense  for  Acquisition,  Technology  and  Logistics  [OUSD(AT&L)], 
Department  of  Defense  [DoD]),  knowledge  about  the  potential  outcome  of  the  comparative 
test  and  implications  for  test  structuring  may  lead  to  a  more  cost-effective  test  execution, 
providing  maximal  information  about  the  SUT  performance  under  operational  conditions 
given  resources  expended. 

This  paper  offers  an  applied  statistical  approach  to  understanding  the  type  of  results 
which  may  emerge  during  comparative  testing  of  the  SUT  a  priori.  Statistical  analysis  is 
commonly  used  in  the  physical  and  social  sciences  to  understand,  quantify,  and  evaluate 
differences  between  treatment  groups  and  control  groups  (Wooldridge,  2003).  The  statistical 
analysis  employed  to  evaluate  differences  may  range  from  a  numerical  or  graphical 
description  of  observed  differences  using  descriptive  statistics  to  a  more  complex  analysis  in 
understanding  the  implications  of  pattern  differences  while  accounting  for  randomness  using 
inferential  statistics.  The  specific  statistical  approach  employed  depends  on  the  type  of 
scientific  inquiry  being  conducted  and  the  data  available.  For  this  paper,  we  were  concerned 
with  understanding  the  performance  difference  of  the  SUT  relative  to  the  baseline  system 
during  the  comparative  test.  In  particular,  we  utilized  the  concept  of  statistical  inference  to 
determine  the  needed  performance  difference  between  the  SUT  and  the  baseline  system  for 
statistical  significance.  Next,  we  highlighted  the  implications  of  this  analysis  for  test 
structuring.  If  performance  differences  are  statistically  different,  useful  information  may  be 
gained  from  conducting  the  test  as  is.  Performance  differences  which  are  not  statistically 
different  may  indicate  that  the  test  should  be  restructured  or  postponed.  In  either  case,  the 
relevant  decision-maker  is  provided  with  information  about  the  potential  results  of  the  test 
beforehand  in  order  to  make  an  informed  decision. 

To  present  the  utilization  of  statistical  inference  in  understanding  the  SUT 
performance  a  priori,  this  paper  is  divided  into  the  following  sections.  The  section  titled 
Statistical  Inference  in  Operational  Testing  and  Evaluation  presents  an  overview  of  the 
acquisition  process  of  weapons  systems  and  discusses  the  use  of  statistical  inference  in 
operational  testing  and  evaluation.  In  the  section  titled  Application  of  Statistical  Inference  in 
Guiding  Operational  Test  Expectations,  we  illustrated  the  method  of  statistical  inference  on 
a  system,  which  improves  situational  awareness  on  the  battlefield.  We  defined  a  number  of 
comparative  metrics  used  to  evaluate  the  operational  effectiveness  of  the  baseline  system 
and  the  SUT.  In  the  section  titled  Potential  Outcomes  and  Analysis  we  delved  a  bit  further 
into  the  analysis  of  the  potential  outcome  of  the  comparative  test.  From  the  situational 
awareness  system  presented  in  this  paper,  we  demonstrated  the  insights,  which  may  be 
gleaned  and  the  implications  for  operational  testing  using  statistical  inference.  A  few 
assumptions  were  made  in  evaluating  the  potential  outcome  of  the  comparative  test.  The 
section  titled  Sensitivity  Analysis  tests  the  robustness  of  the  derived  conclusions  in  the 
Potential  Outcomes  and  Analysis  section  to  changes  in  these  assumptions.  The  section 
titled  Conclusion  completes  this  study.  Although  information  about  the  actual  system  in  this 
study  has  been  masked,  the  data  and  analysis  is  representative  of  the  actual  system,  and 
the  implications  and  conclusions  of  this  paper  remained  consistent  with  those  derived  from 
the  original  study 


ACQUISITION  RESEARCH:  CREATING  SYNERGY  FOR  INFORMED  CHANGE  -  346 


Statistical  Inference  in  Operational  Testing  and  Evaluation 

The  acquisition  of  a  weapons  system  is  traditionally  divided  into  five  phases  with 
each  phase  having  the  requisite  milestone.  An  acquisition  program  is  required  to  meet  the 
specific  statutory  and  regulatory  requirements  dictated  by  the  milestone  to  proceed  to  the 
next  phase.  The  Milestone  Decision  Authority  (MDA)  holds  the  responsibility  for  determining 
whether  the  requirements  of  the  milestone  have  been  meet  and  the  weapons  program  may 
proceed  to  the  next  phase.  The  phases  of  the  acquisition  process  are  shown  in  Figure  1 . 


User  Needs 


Technology  Opportunities  &  Resources 


•  The  Materiel  Development  Decision  precedes 
entry  into  any  phase  of  the  acquisition 
management  system 

•  Entrance  criteria  met  before  entering  phase 

•  Evolutionary  Acquisition  or  Single  Step  to 
Full  Capability 
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Figure  1 .  Overview  of  Acquisition  of  Weapon  Systems 

Note.  Adapted  from  Department  of  Defense  Instruction  5000.02  (USD[AT&L],  2008). 

The  first  phase,  the  Materiel  Solution  Analysis,  provides  a  preliminary  analysis  of  the 
weapon  systems.  Within  this  phase,  analysts  first  assess  the  user  needs.  If  a  need  is  shown 
to  be  evident,  the  analysts  conduct  an  analysis  of  alternatives  to  evaluate  probable  options 
to  fulfilling  the  need.  The  second  phase,  Technology  Development,  involves  a  determination 
of  the  technologies  needed  to  operationalize  the  weapon  system  as  well  as  the  development 
and  testing  of  the  technology.  Once  the  technology  is  shown  to  be  functional  in  a  relevant, 
or  in  the  preferred  case,  an  operational  environment,  the  program  may  proceed  to  the  third 
phase  of  the  acquisition  process.  Within  the  third  phase,  Engineering  and  Manufacturing 
Deployment,  the  various  sub-systems  are  developed,  tested,  and  fully  integrated  into  a 
complete  weapon  system.  Also  within  this  phase,  a  system  demonstration  occurs  to  show 
the  military  utility  of  the  system  as  well  as  the  manufacturing  resources  and  processes 
required  (Schwartz,  2010).  The  fourth  stage  is  the  Production  and  Deployment  phase.  In  this 
phase,  the  weapon  system  enters  Low  Rate  Initial  Production  (LRIP)  and  an  Initial 
Operational  Testing  and  Evaluation  (IOT&E)  occurs  to  determine  the  battle  worthiness  of  the 
system.  Congress  requires  testing  of  major  systems  and  weapons  programs  to  be 
conducted  under  operationally  realistic  conditions  to  determine  the  operational  suitability  of 
the  system  and  whether  it  should  proceed  beyond  LRIP  (Fox,  Boito,  Graser,  &  Younossi, 
2004).  The  final  phase,  Operations  and  Support  involves  a  commitment  to  the  full  rate 
production  and  operation  of  the  system.  The  system  is  fielded  in  a  real  time  operational 
environment  and  maintenance  support  (among  other  types  of  support)  provided  by  the 
relevant  contractor(s). 

During  IOT&E,  comparative  evaluations  are  sometimes  conducted.  These  tests  are 
side-by-side  comparisons  in  which  the  performance  of  a  battalion  with  the  SUT  and  without 
the  SUT  will  be  examined  through  a  series  of  tactical  battles.  The  intent  of  the  test  is  to 
determine  whether  (and  by  how  much)  the  unit’s  performance  improves  with  the  SUT.  One 
method  for  assessing  whether  an  improvement  has  occurred  is  through  the  use  of  statistical 
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inference.  This  technique,  in  particular  significance  testing,  is  a  well-established  method  for 
determining  whether  the  outcome  of  a  treatment  scenario  differs  significantly  from  a 
controlled  scenario.  Generally,  statistical  inference  is  used  in  the  analysis  of  the  outcome  of 
operational  tests  and  has  been  noted  as  a  best  practice  in  system  evaluation  (Commission 
on  Behavioral  and  Social  Sciences  and  Education  [CBASSE],  1998).  While  it  is  not 
suggested  that  statistical  inference  is  the  sole  evaluation  tool,  statistical  inference 
possesses  a  number  of  advantages  primarily  among  which  is  its  objectivity  given  the  various 
incentives  and  motivations  of  the  stakeholders  in  the  acquisition  process.  In  addition, 
through  statistical  inference,  it  is  possible  to  gain  insight  beforehand  on  the  outcome  of 
comparative  tests. 

In  prior  comparative  tests,  particularly  for  ground  combat  systems,  there  has  been 
mixed  success  in  establishing  improved  effectiveness  of  new  systems  using  operational 
performance  metrics.  In  most  of  these  tests,  a  major  contributor  to  the  difficulty  in  finding  a 
statistically  significant  “difference”  between  the  performances  of  the  unit  with  the  SUT  and 
without  the  SUT  has  been  the  sizable  magnitude  of  the  variability  within  the  data  for  the 
metric  being  considered  and  the  small  sample  size  (CBASSE,  1995).  That  is,  the  standard 
deviation  within  the  data  has  been  so  large  that  finding  a  difference  between  the  means  or 
other  measures  of  central  tendencies  requires  a  really  sizable  difference  in  the  means  of  the 
two  data  sets — generally  larger  than  can  be  reasonably  expected  in  combat. 

The  phenomenon  is  illustrated  in  Figure  2.  Ideally,  we  expect  to  see  what  is  shown 
on  the  left.  The  ideal  data  would  show  small  performance  variability  by  the  SUT  and  the 
baseline  system  accompanied  by  a  significant  mean  performance  difference  between  the 
SUT  and  the  baseline  system.  What  frequently  happens  in  ground  combat  tests  is  depicted 
on  the  right.  The  actual  data  commonly  reveals  large  performance  variability  for  both  SUT 
and  baseline  systems  as  well  as  marginal  differences  in  their  mean  performance. 


IDEAL 


ACTUAL 


A 

x, 

I  I 


Figure  2.  Illustration  of  Ideal  Versus  Actual  Variability  in  Metrics 

In  the  end,  the  data  generally  show  that  any  “apparent”  difference  between  the 
performance  of  the  unit  with  and  without  the  new  capability  has  proven  to  be  not  statistically 
“significant.”  Upon  finding  “no  difference”  in  operational  performance  metrics,  alibis  are 
offered  for  the  technical  results  (e.g.,  unit  was  not  trained,  poor  scenarios),  and  the 
assessments  resort  to  subjective  measures  (e.g.,  interview  comments,  commanders 
impressions)  to  support  the  case  for  buying  the  new  system. 

In  order  to  avoid  such  situations,  we  propose  that  it  is  possible  to  examine  possible 
outcomes  of  the  comparative  tests  beforehand  through  the  application  of  statistical 
inference.  The  outcome  of  an  a  priori  analysis  using  statistical  inference  may  point  to  areas 
where  the  test  may  be  modified  or  additional  control  measures  may  be  introduced  to 
increase  the  likelihood  of  obtaining  desired  results  or  highlighting  scenarios  in  which 
utilization  of  the  SUT  may  not  be  appropriate.  In  the  next  section,  we  demonstrate  the  use  of 
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statistical  inference  on  a  system  designed  to  improve  situational  awareness  and  describe 
how  the  results  may  be  used  to  guide  expectations  on  the  outcome  of  the  comparative  tests. 

Application  of  Statistical  Inference  in  Guiding  Operational  Test  Expectations 

One  of  the  key  insights  emerging  from  Operation  Joint  Endeavor  and  Operation 
Desert  Storm  was  the  need  for  a  mobile  infantry  that  would  rapidly  deploy  ahead  of  the 
armed  forces.  To  support  this  rapid  deployment,  a  number  of  operational  needs  statements 
(ONS)  from  theater  called  for  ground  and  aerial  robotic  capability  to  enable  better  situation 
awareness  and  understanding.  These  systems  would  improve  intelligence,  surveillance  and 
reconnaissance  at  the  lower  unit  level  through  advanced  technological  capabilities  and  allow 
speedy  intelligence  dissemination  through  enhanced  networking  capabilities.  Over  the  last 
two  decades,  a  number  of  systems  have  been  or  are  being  developed  to  address  the  issue 
of  situational  awareness  and  understanding.  Examples  of  these  include  the  Battlefield 
Combat  Identification  System  (BCIS),  a  secure  question  and  answer  system  that  was 
intended  to  perform  active  identification  of  friendly  targets  to  minimize  fratricide  on  the 
battlefield,  and  the  Early  Infantry  Brigade  Combat  Team  systems  (nee  Future  Combat 
System),  which  were  intended  to  rapidly  and  securely  disseminate  information,  thereby 
providing  a  technological  advantage  over  the  enemy  on  the  battlefield  (DoDIG,  2001 ;  U.S. 
Army,  2011). 

In  this  paper,  we  explored  the  potential  benefits  of  a  system  under  test  (SUT),  which 
improves  the  situational  awareness  on  the  battlefield  by  allowing  soldiers  to  detect  and 
identify  threats  (persons  or  otherwise)  from  a  secure  distance  and  in  a  reasonable 
timeframe.  We  comparatively  examined  the  effect  of  the  SUT  on  unit  mission  success, 
casualties,  and  fratricides  relative  to  those  units  that  do  not  possess  the  SUT  based  on  data 
collected  from  a  previous  Limited  User  Test  2009  (LUT  09).  Specifically,  based  on  the 
means  and  standard  deviations  of  the  selected  metrics  collected  in  the  LUT  09,  we 
evaluated  whether  the  behavior  of  these  means  and  standard  deviations  can  reasonably  be 
expected  to  generate  differences  that  are  statistically  significant  during  any  subsequent 
Initial  Operational  Test  &  Evaluation  (IOT&E)  event. 

Evaluation  Metrics 

A  listing  of  the  proposed  evaluation  paradigm  for  the  IOT&E  comparison  was 
developed  and  approved  by  testing  offices  within  the  Department  of  Defense.  From  this 
listing,  we  examined  select  metrics  for  which  data  are  available  from  the  earlier  LUT  09. 

First,  we  computed  the  means  (or  other  measures  of  central  tendencies)  and  the  standard 
deviations.  Next,  assuming  those  values  represented  the  “treatment”  situation  (i.e.,  as  the 
SUT  was  used  in  the  LUT  09,  the  “treatment”  situation  is  the  unit  performance  with  the 
SUT),  we  examined  how  different  the  performance  would  have  to  be  in  the  baseline  for 
there  to  be  a  statistical  difference  exhibited  in  the  data  we  have.  In  all  cases,  we  assumed 
that  the  standard  deviation  exhibited  in  the  baseline  case  is  the  same  as  that  in  the 
treatment  situation  as  no  variability  data  exists  for  the  baseline.  However,  this  assumption  is 
tested  later  in  our  sensitivity  analysis  in  the  Sensitivity  Analysis  section.  (Note:  Also,  as  the 
unit  claimed  that  the  systems  did  not  help  them,  we  reexamined  the  data  assuming  the 
values  obtained  in  the  test  represent  the  “baseline”  and  observed  how  much  improvement 
the  new  systems  must  provide  in  order  to  be  different.  In  most  cases,  the  magnitude  of  the 
difference  is  all  that  matters;  thus,  whether  the  data  we  have  is  baseline  or  treatment  is  a 
moot  point  except  in  select  cases  where  the  parameters  are  not  symmetric).  The  measures 
we  examined  were  as  follows: 
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■  The  number  of  times  BLUFOR  accomplished  the  assigned  mission.  (Note: 

We  compared  the  number  of  battles,  using  a  “sign  test”  for  paired  battles,  and 
also  the  percentage  of  total  battles  that  the  BLUFOR  accomplished.) 

■  The  number  of  BLUFOR  and  OPFOR  casualties.  (Specifically,  we  looked  at 
percentages.) 

■  The  number  of  fratricides  incidents.  (Specifically,  we  looked  at  percentages.) 

These  three  measures  assess  the  top  level  performance  of  a  unit  relative  to  another 
unit.  It  is  understood  that  the  actual  comparison  during  any  subsequent  IOT&E  will  look  at 
numerous  other  measures,  including  subjective  ones.  For  example,  structured  interviews 
may  be  considered  complimentary  to  the  statistical  analysis  and  may  be  performed  during 
the  comparative  test  to  aid  in  explaining  why  statistical  differences  did  or  did  not  occur 
during  the  test.  However,  for  many  of  these  measures,  no  data  were  collected  in  LUT  09, 
and  we  felt  that  the  selected  measures  would  give  a  reasonable  indication  of  what  to  expect. 

The  key  discriminators  between  the  SUT  battalion  and  the  baseline  battalion  would 
be  (1)  the  degree  of  improved  situational  awareness  provided  to  the  unit  and  (2)  the  impact 
of  having  this  improved  situational  awareness.  The  expectation  is  that  the  metrics  (or 
measures  of  merit)  will  show  improved  situational  awareness  attributable  to  the  presence  of 
the  SUT,  and  no  loss  of  lethality  or  force  protection  when  compared  to  the  baseline.  The 
metrics  are  applicable  to  both  the  SUT  and  the  baseline  battalion  and  serve  as  the  basis  for 
comparison  between  them. 

Definition  of  Metrics 

Mission  Success 

Mission  success  is  a  complex  measure  driven  by  a  number  of  factors,  among  which 
are  the  number  of  BLUFOR  kills,  the  number  of  civilian  kills,  whether  the  unit  achieved  its 
objective,  etc.  For  our  analysis,  we  relied  on  expert  determination  by  subject-matter  experts 
(SMEs)  on  site  during  the  test  for  identifying  whether  a  mission  was  accomplished.  We  used 
two  metrics  to  assess  mission  success.  The  first  metric  was  the  number  of  times  the 
BLUFOR  unit  accomplished  its  missions.  Using  this  metric,  we  performed  a  sign  test  to 
compare  the  baseline  unit  and  the  SUT  unit.  The  second  metric  was  the  mission  success 
rate.  This  metric  normalizes  the  number  of  accomplished  missions  by  the  number  of 
missions  conducted.  It  is  a  bit  more  informative  than  simply  the  number  of  accomplished 
missions  as  it  indicates  the  past  success  rate  of  the  BLUFOR  unit  in  accomplishing  its 
missions.  The  mission  success  rate  is  calculated  as  follows: 

.  „  .  _  Number  of  Missions  Accomplished 

Mission  Success  Rate  = - 

Number  of  Missions  Conducted  ^ 

For  this  metric,  we  used  a  two  proportion  z-test  in  the  comparative  analysis. 

Casualties 

In  this  study,  we  defined  casualties  as  the  number  of  kills  a  unit  sustains.  Initially,  we 
considered  two  metrics  to  assess  casualties  sustained  by  the  units.  These  were  the  number 
of  losses  and  the  casualty  rate.  The  number  of  losses  is  the  total  number  of  casualties  a  unit 
incurs  over  the  mission.  While  this  metric  gives  a  first  order  glimpse  of  the  force  protection 
capability  of  the  unit,  it  does  not  account  for  the  cost  of  these  casualties  to  the  unit.  For 
example,  a  casualty  loss  of  20  soldiers  is  more  costly  to  a  unit  that  has  a  starting  strength  of 
30  than  it  is  to  a  unit  that  has  a  starting  strength  of  130.  For  this  reason,  we  decided  to  look 
at  the  casualty  rate.  The  casualty  rate  incorporates  information  about  the  cost  of  casualties 
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to  the  unit  by  normalizing  the  number  of  casualties  a  unit  sustains  by  the  unit  starting 
strength.  The  formula  for  the  casualty  rate  is  shown  below. 

TT  .  „  ,  _  Number  of  Losses  Sustained  by  Unit 

Unit  Casualty  Rate  = - 

Starting  Strength  of  Unit  ^ 

Using  the  previous  example,  a  unit  with  a  starting  strength  of  30  which  has  20 
casualties  will  have  a  high  casualty  rate  of  0.66,  while  a  unit  with  a  starting  strength  of  130 
will  have  a  low  casualty  rate  of  0.15.  In  this  analysis  we  used  the  casualty  rate  and  the 
student  Mest  to  draw  conclusions  on  what  to  expect  in  the  IOT&E. 

Fratricides 

A  number  of  definitions  exist  for  fratricides.  Among  these  are  (1 )  any  engagement  in 
which  a  friend  fires  at  a  friend,  whether  damage  is  done  or  not  and  2)  casualties  caused  by 
friendly  fire.  For  the  purpose  of  this  analysis  we  used  the  second  definition,  casualties 
caused  by  friendly  fire,  as  this  definition  more  accurately  reflects  the  damage  caused.  It  is 
important  to  note,  however,  that  the  first  definition  is  equally  as  relevant  as  the  second 
definition  in  assessing  how  well  the  soldier  is  able  to  distinguish  a  threat  from  a  friendly.  In  a 
similar  vein  to  casualties,  we  considered  two  metrics,  the  number  of  fratricides  and  the 
fratricide  rate.  This  rate  is  the  number  of  unit  fratricides  as  a  percentage  of  the  total  unit 
casualties. 


Unit  Fratricide  Rate  = 


Number  of  Unit  Fratricides 
Number  of  Unit  Casualties 


(3) 


For  similar  reasons  discussed  previously,  we  selected  the  fratricide  rate  as  the 
comparison  metric.  This  metric  is  insightful  as  it  indicates  the  likelihood  that  a  solider  is  killed 
by  another  soldier  in  the  same  unit.  Alternatively,  this  metric  may  be  viewed  as  a  measure  of 
the  self-inflicted  casualties  in  a  unit.  Using  a  student  f-test,  we  sought  to  determine  whether 
it  is  possible  to  draw  statistical  conclusions  about  the  ability  of  the  SUT  to  reduce  fratricides 
relative  to  the  baseline  systems  through  improved  situational  awareness. 


Potential  Outcomes  and  Analysis 


Mission  Success 


There  were  13  missions  in  the  SUT  LUT  09:  Three  were  attack  missions,  two  were 
defend  missions,  three  were  cordon  and  search  missions,  four  were  raid  missions,  and  the 
remaining  one  was  a  stability  ops  mission.  The  BLUFOR  had  an  85%  success  rate, 
accomplishing  11  of  the  13  missions,  partially  accomplishing  one,  and  failing  to  accomplish 
one.  A  summary  of  these  mission  success  statistics  is  shown  in  Table  1 . 


Table  1.  Description  of  Mission  Outcomes 


Mission 

Mission  Type 

Successful 

BLUFOR 

Starting 

Strength 

BLUFOR 

Casualties 

OPFOR 

Starting 

Strength 

OPFOR 

Casualties 

1 

Raid 

yes 

130 

10 

50 

26 

2 

Raid 

yes 

130 

7 

50 

25 

3 

Defend 

yes 

130 

25 

50 

0 
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4 

Attack 

yes 

130 

15 

50 

10 

5 

Attack 

yes 

130 

25 

50 

8 

6 

Cordon  and  Search 

yes 

130 

8 

50 

7 

7 

Defend 

yes 

130 

16 

50 

15 

8 

Cordon  and  Search 

yes 

130 

12 

50 

6 

9 

Raid 

partially 

130 

7 

50 

3 

10 

Cordon  and  Search 

yes 

130 

20 

50 

8 

11 

Attack 

no 

130 

14 

50 

10 

12 

Stability  Operations 

yes 

130 

2 

50 

5 

13 

Raid 

yes 

130 

10 

50 

22 

An  initial  glance  at  the  high  mission  success  rate  might  imply  that  the  SUT 
contributed  positively  to  situational  awareness  and  thus  operational  performance.  However, 
caution  is  advised  against  prematurely  drawing  this  conclusion  from  the  data.  Table  1 
indicates  that  on  average  the  BLUFOR  starting  strength  was  about  two  to  three  times  that  of 
the  OPFOR.  This  difference  in  starting  strength  may  give  the  BLUFOR  a  significant 
advantage.  Without  properly  accounting  for  this  advantage,  it  is  possible  to  incorrectly 
conclude  that  the  performance  of  the  BLUFOR  is  attributed  to  the  SUT. 

Missions  Accomplished 

To  evaluate  the  possibility  of  determining  whether  the  high  mission  success  rate  may 
be  attributed  to  SUT,  we  conducted  statistical  analyses  on  the  LUT  09  data.  In  particular,  we 
examined  how  many  missions  the  baseline  unit  would  have  to  lose,  given  the  85%  mission 
success  rate  (or  1 1  missions  accomplished)  of  SUT  unit,  to  be  significantly  different 
statistically.  The  first  metric  examined  was  the  number  of  missions  accomplished.  For  the 
analysis  of  missions  accomplished,  we  use  the  binomial  sign  test  (Sheskin,  2004). 

The  binomial  sign  test  compares  differences  in  the  performance  of  baseline  systems 
relative  to  the  SUT  system  using  paired  tests.  The  test  statistic  only  considers  the  mission 
outcomes,  which  differ  between  the  system  under  test  and  the  baseline  systems  as  these 
differing  outcomes  act  as  discriminators  between  the  two  tests.  These  possible  outcomes 
are  shown  in  Table  2.  Only  12  missions  were  evaluated,  as  the  partially  successful  mission 
was  eliminated  from  the  dataset.  The  variables  in  the  table  were  defined  as  follows: 

Xww  -  number  of  missions  accomplished  by  both  the  baseline  and  the  SUT  units 

XWl  -  number  of  missions  accomplished  by  the  SUT  unit,  but  not  the  baseline  unit 

XLw-  number  of  missions  accomplished  by  the  baseline  unit,  but  not  the  SUT  unit 

XLl  -  the  number  of  missions  not  accomplished  by  both  the  baseline  and  the  SUT 
unit 


Table  2.  Notional  Representation  of  Mission  Outcomes 

Baseline 
W  L 
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SUT 

W 

L 

Xww  XwL 

X|_w  X|_L 

11 

1 

Xww  +  X|_w  XWL+  X|_L 

12 

For  the  binomial  sign  test,  the  number  of  trials  was  defined  as  XLW  +  XWl,  or  the  total 
number  of  missions  in  which  the  outcome  differs  between  the  two  groups.  The  number  of 
successes  was  defined  as  XWL,  or  the  number  of  times  the  SUT  unit  outperforms  the 
baseline  unit.  The  null  hypothesis  assumed  that  there  is  no  difference  between  the  mission 
performance  of  the  SUT  unit  and  the  baseline  unit.  Therefore,  a  success  and  a  non-success 
are  equally  likely  to  occur,  leading  to  the  null  hypothesis  being  defined  as: 

H0:  p  =  0.5  (4) 

where  p  is  the  likelihood  of  success.  As  the  objective  of  the  comparative  test  during 
any  subsequent  IOT&E  is  to  determine  whether  the  SUT  positively  contributes  to  situational 
awareness,  our  alternative  hypothesis  was  one  directional  and  given  by: 

Ha:  p  >  0.5  (5) 

The  test  was  performed  with  a  90%  confidence  level.  Next,  we  used  the  cumulative 
binomial  probability  distribution  function  to  determine  the  likelihood  that  the  SUT  unit 
outperforms  the  baseline  unit  a  certain  number  of  times  or  greater  (e.g.,  eight  or  more 
times).  If  this  probability  was  less  than  a=  0.1 ,  we  rejected  the  null  hypothesis  and 
concluded  that  there  is  statistical  evidence  the  SUT  unit  outperforms  the  baseline  unit  and 
enhances  situational  awareness.  For  this  study,  we  were  concerned  with  the  number  of 
losses  by  the  baseline  unit  for  statistical  significance.  Table  3  shows  the  results  of  the 
analysis  where  the  probability  columns  indicate  the  likelihood  that  the  SUT  unit  outperforms 
the  baseline  unit  by  at  least  a  certain  number  of  missions  (e.g.,  eight  or  more),  and  the 
required  baseline  losses  columns  indicate  the  actual  losses  required  to  observe  the  SUT 
unit  outperform  the  baseline  unit  by  a  certain  number  of  missions. 
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Table  3. 


Required  Baseline  Losses 


XLw 

1 

0 

Pr(Number  of 
Successes  >  XWl) 

Required 

Baseline 

Losses 

Pr(Number  of 
Successes  >  XWl) 

Required 

Baseline 

Losses 

0 

1.000 

0 

1.000 

1 

1 

0.750 

1 

0.500 

2 

2 

0.500 

2 

0.250 

3 

3 

0.313 

3 

0.125 

4 

XwL 

4 

0.188 

4 

0.063 

5 

5 

0.109 

5 

0.031 

6 

6 

0.062 

6 

0.016 

7 

7 

0.035 

7 

0.008 

8 

8 

0.020 

8 

0.004 

9 

9 

0.011 

9 

0.002 

10 

10 

0.006 

10 

0.001 

11 

11 

0.003 

11 

0.000 

12 

There  are  a  couple  of  facts  to  note  about  this  table.  First,  as  the  SUT  unit  failed  to 
accomplished  only  one  mission,  there  are  only  two  possible  values  for  XLW.  This  greatly 
reduced  our  analysis.  However,  a  single  loss  meant  that  there  were  12  possible  values  for 
XWL,  all  of  which  are  laid  out  in  the  table.  As  expected,  the  probability  of  achieving  a  certain 
number  of  successes  or  greater  decreased  as  XWl  increased.  For  example,  assuming  the 
baseline  unit  outperforms  the  SUT  unit  in  one  mission  (i.e.,  XLw  =  1),  the  probability  of  the 
SUT  unit  outperforming  the  baseline  unit  six  or  more  times  is  0.062,  while  1 1  or  more  times 
is  0.003.  The  table  also  indicates  that  the  required  minimum  number  of  losses  by  the 
baseline  unit  for  statistical  significance  is  six,  or  around  half  of  the  12  missions.  Five  losses 
or  lower  will  lead  to  statistically  inconclusive  results.  In  the  case  where  the  baseline  never 
outperforms  the  SUT  unit  (i.e.,  XLw  =  0),  the  minimal  number  of  losses  to  show  a  statistically 
significant  difference  between  the  two  units  is  five.  Four  or  more  losses  will  lead  to 
statistically  inconclusive  results. 

In  summary,  based  on  the  LUT  09  results  in  which  the  BLUFOR  won  11  of  the  13 
battles,  a  baseline  unit  would  have  to  underperform  the  SUT  unit  by  five  or  more  missions  to 
be  considered  statistically  different  from  the  observed  outcome  of  the  SUT  unit. 

Mission  Success  Rate 

The  second  metric  of  mission  success  considered  was  the  mission  success  rate.  We 
used  a  one-tail  two-proportion  z-test  to  determine  the  required  reduced  level  of  baseline  unit 
performance  to  produce  a  statistically  significant  difference.  The  following  formula  was 
applied  (Ott  &  Longnecker,  2010): 
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z  = 


Pi- Pi 


AQ-Pi)  i  PiOzA) 


(6) 


where  pl  is  the  success  rate  of  the  SUT  unit  and  p2  is  the  success  rate  of  the  baseline  unit 
and  nx  and  n2  is  the  number  of  missions  conducted  by  the  SUT  and  the  baseline  unit, 
respectively.  Using  a  z-value  of  1 .28  and  a  mean  SUT  mission  success  rate  of  0.85  (or 
85%),  we  solved  for  the  baseline  mean  mission  success  rate  given  the  number  of  baseline 
missions  conducted.  Figure  3  shows  the  missions  success  rate  and  the  number  of  missions 
conducted.  The  mean  mission  success  rate  depicted  in  the  figure  is  the  maximum  rate  the 
baseline  unit  can  achieve.  Beyond  this  maximum  rate,  the  results  of  the  comparative 
evaluation  become  statistically  inconclusive.  The  red  dot  is  the  mission  success  rate  of  the 
SUT  unit  observed  in  LUT  09. 
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Number  of  Baseline  Missions 

Figure  3.  Maximum  Required  Mean  Baseline  Mission  Success  Rate 

The  figure  shows  that  for  a  large  number  of  baseline  missions  conducted,  for 
example  20  missions,  the  required  mean  mission  success  rate  for  statistical  difference  is 
approximately  66%.  As  the  number  of  missions  decreases  to  13  or  that  were  conducted  in 
the  LUT  09,  the  required  mission  success  rate  is  63%.  Further  decreases  in  the  number  of 
missions  conducted  result  in  success  rates  below  60%  (i.e.,  there  is  not  a  great  deal  of 
confidence  that  the  baseline  unit  will  accomplish  the  mission). 

Figure  4  breaks  down  the  mission  success  rate  by  showing  the  number  of  the 
required  mission  losses  given  the  number  of  missions  conducted.  If  13  missions  are 
conducted,  the  required  number  of  baseline  losses  for  statistical  significance  is  five.  It  is 
interesting  to  note,  that  this  closely  mirrors  the  results  obtained  from  using  the  binomial  sign 
test.  That  is,  whether  considering  the  sign  test  (number  of  missions  accomplished)  or 
mission  success  rate,  in  order  for  there  to  be  a  significant  difference  between  a  baseline  and 
SUT  unit,  the  baseline  unit  needs  to  lose  four  to  five  more  missions  than  the  SUT  unit  based 
on  the  LUT  09  data. 
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Figure  1. 


Minimum  Number  of  Baseline  Mission  Losses 


For  both  metrics  of  mission  success  considered,  the  statistics  indicated  that  the 
baseline  unit  will  need  to  perform  very  poorly  to  produce  conclusive  results.  However,  as 
currently  constructed,  the  overwhelming  starting  strength  of  the  BLUFOR  argues  against 
such  an  outcome.  The  BLUFOR  starting  strength  to  the  OPFOR  starting  strength  ratio  is  on 
average  greater  than  2:1 .  This  provides  the  BLUFOR  with  a  significant  advantage,  which 
may  be  leveraged  to  overwhelm  the  OPFOR  during  various  operations  with  or  without  the 
SUT.  Given  the  current  test  set-up  of  a  substantial  BLUFOR  manpower  advantage  relative 
to  the  OPFOR  manpower,  it  is  unlikely  that  these  metrics  will  produce  any  conclusive  results 
about  the  contribution  of  the  SUT  to  mission  success  by  enhancing  situational  awareness 
and  understanding  in  a  comparative  evaluation. 


Mission  Casualties 

One  of  the  metrics  used  to  judge  whether  the  SUT  provides  better  situational 
awareness  was  a  decline  in  BLUFOR  casualties  during  tactical  battles.  Figure  5  shows 
casualty  data  per  mission  for  LUT  09. 
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□  BLUFOR  ■  OPFOR  □  Civilian 

Figure  5.  Number  of  Casualties  per  Mission 

There  was  a  high  number  of  BLUFOR  casualties  in  almost  every  operation,  with  the 
greatest  number  of  casualties  occurring  primarily  in  attack  and  defend  missions.  For  the 
OPFOR,  high  losses  were  incurred  primarily  during  raids  with  Mission  1  being  the  most 
devastating.  Although  not  shown  on  this  figure,  it  is  interesting  to  note  that  despite  the 
potential  increase  in  situational  awareness  offered  by  the  SUT  system,  all  civilian  casualties 
were  caused  by  the  BLUFOR. 

In  order  to  assess  whether  there  was  a  plausible  likelihood  of  drawing  statistical 
conclusions  about  the  difference  in  casualties  sustained  by  the  SUT  unit  and  those 
sustained  by  the  baseline  unit,  we  used  the  casualty  rate.  This  metric  was  defined 
previously  as  the  following: 

T  r  _  ,  Number  of  Losses  Sustained  by  Unit 

Unit  Casualty  Rate  = - 

Starting  Strength  of  Unit  ^ 

Next  we  used  a  one-directional  student  t- test  at  the  90%  confidence  level  to 
determine  the  minimum  required  mean  casualty  rate  of  the  baseline  unit  to  generate  a 
statistical  difference.  The  formula  we  used  for  the  student  f-test  statistic  is  shown  below 
(Sheskin,  2004): 

,  = _ X,-X2 _ 

“ |(n,-l)S,;+(n2-l)5f  IT  +  J_ 

y  nx  +  n2  -  2  y n2  ^ 

where  Xx  is  the  mean  casualty  rate  of  the  SUT  unit  and  X2  is  the  mean  casualty 
rate  of  the  baseline  unit,  nx  +n2-2  is  the  degrees  of  freedom  and  Sx  and  S2  are  the 
standard  deviations  for  the  respective  units.  We  solved  for  the  baseline  casualty  rate  given 
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an  average  SUT  unit  casualty  rate  of  10.1%  and  the  number  of  baseline  missions 
conducted.  Figure  6  displays  the  results.  The  mission  success  rate  depicted  in  the  figure  is 
the  minimum  casualty  rate  the  baseline  unit  can  achieve  for  significance.  Lower  than  this 
rate,  the  results  of  the  comparative  evaluation  become  inconclusive.  The  red  dot  is  the 
casualty  rate  of  the  SUT  unit. 
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Figure  6.  Minimum  Required  Mean  BLUFOR  Casualty  Rate 

The  figure  shows  that  for  a  large  number  of  baseline  missions  conducted,  for 
example  20  missions,  the  minimum  required  casualty  rate  for  statistical  difference  is  12.6%. 
As  the  number  of  missions  decreases  to  13  or  that  were  conducted  in  the  LUT  09,  the 
minimum  required  casualty  rate  is  12.9%.  Further  decreases  in  the  number  of  missions 
conducted  results  in  higher  required  casualty  rates  for  the  baseline  BLUFOR. 

We  conducted  a  similar  analysis  for  the  OPFOR,  the  results  of  which  are  shown  in 
Figure  7.  For  the  OPFOR,  the  two  units  being  compared  are  the  OPFOR  unit  against  an 
SUT  unit,  and  the  OPFOR  unit  against  a  baseline  unit.  In  contrast  to  the  BLUFOR,  the  mean 
casualty  rate  shown  in  Figure  7  is  the  maximum  casualty  rate  sustained  by  the  OPFOR  for  a 
given  number  of  missions.  The  red  dot  is  the  mean  casualty  rate  of  the  OPFOR  unit  against 
the  SUT  unit  and  has  a  value  of  22.3%. 
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Figure  7.  Maximum  Required  Mean  OPFOR  Casualty  Rate 


The  results  indicate  that  a  maximum  casualty  rate  of  14.5%  is  required  if  the  IOT&E 
conducts  20  or  more  baseline  missions.  Conducting  approximately  13  baseline  missions  will 
necessitate  a  maximum  casualty  rate  of  13.7%  for  a  statistical  difference.  As  the  number  of 
missions  decreases  below  13,  the  maximum  required  casualty  rate  falls  to  low  values  of 
12.0%  or  below.  That  is,  for  a  starting  strength  of  about  60,  the  OPFOR  unit  will  only  lose 
about  seven  soldiers. 

In  order  to  determine  whether  the  required  mean  OPFOR  and  BLUFOR  casualty 
rates  are  reasonable  values,  we  compared  these  values  to  rates  generally  observed  from 
tactical  battles  in  operational  tests.  We  selected  an  average  casualty  rate  of  approximately 
10%  as  our  guideline  based  on  discussions  with  analysts  from  the  Institute  of  Defense 
Analyses.  The  value  of  10%  was  in  the  range  of  that  exhibited  by  the  SUT  unit  during  the 
LUT  09.  Using  the  10%  guideline,  we  surmised  that  achieving  a  required  mean  BLUFOR 
casualty  rate  of  12.9%  may  be  possible  during  IOT&E.  In  the  case  of  the  OPFOR,  traditional 
casualty  rates  have  been  on  the  order  of  25%  in  recent  operational  tests.  Therefore, 
achieving  a  casualty  rate  of  below  16%  may  prove  quite  challenging  during  IOT&E.  While 
there  is  a  possibility  of  obtaining  significant  results  for  the  BLUFOR  casualty  data,  the  low 
OPFOR  casualty  rate  required  makes  it  unlikely  that  the  current  test  set-up  will  yield 
statistically  conclusive  results  in  the  case  of  the  OPFOR.  In  other  words,  it  is  possible  to 
observe  the  SUT  contribute  to  a  reduction  in  BLUFOR  casualty  during  IOT&E  but  highly 
unlikely  to  observe  a  contribution  to  BLUFOR  lethality. 

Mission  Fratricides 

One  of  the  expected  effects  of  better  situational  awareness  is  reduced  BLUFOR 
fratricide.  Figure  8  shows  the  number  of  fratricide  incidents  per  mission  for  the  LUT  09. 
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Figure  8.  Number  of  Fratricide  Incidents  per  Mission 

In  Missions  2,  8,  10,  and  11,  the  BLUFOR  sustained  a  large  number  of  fratricides 
relative  to  the  other  missions.  These  missions  were  diverse  in  type  with  Mission  2  being  a 
raid,  Missions  8  and  10  being  cordon  and  search  missions,  and  Mission  11  being  an  attack 
mission.  As  such,  no  initial  conclusions  may  be  drawn  about  the  tendency  of  certain 
missions  to  produce  BLUFOR  fratricides.  The  OPFOR  sustained  high  fratricide  losses  in 
three  missions,  two  of  which  were  raid  missions. 

In  this  analysis,  we  were  concerned  primarily  with  BLUFOR  fratricides.  As  OPFOR 
does  not  possess  the  SUT,  OPFOR  will  not  have  enhanced  situational  awareness.  It  is 
expected  the  SUT  will  have  no  effect  on  the  OPFOR  fratricides.  We  used  the  fratricide  rate 
to  determine  the  limits  for  statistical  significance.  As  stated  previously,  the  fratricide  rate  is 
defined  as  the  following: 


Unit  Fratricide  Rate  = 


Number  of  Unit  Fratricides 
Number  of  Unit  Casualties 


(9) 


Similar  to  the  casualty  rate,  we  implemented  a  one-directional  student  t-test  at  the 
90%  confidence  level  to  determine  the  minimum  required  fratricide  rate  of  the  baseline  unit 
to  generate  a  statistical  difference.  We  solved  for  the  baseline  fratricide  rate  given  an 
average  SUT  fratricide  rate  of  14.6%  and  the  number  of  baseline  missions  conducted. 

Figure  9  displays  the  results.  The  fratricide  rate  depicted  in  the  figure  is  the  permissible 
minimum  rate  the  baseline  unit  can  achieve.  Beyond  this  rate,  the  results  of  the  comparative 
evaluation  become  inconclusive.  The  red  dot  is  the  mean  fratricide  rate  of  the  SUT  unit. 
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Figure  9.  Minimum  Required  BLUFOR  Fratricide  Rate 

The  analysis  suggests  that  for  an  extremely  low  number  of  missions  completed  (e.g., 
eight  or  below),  the  minimum  required  fratricide  rate  surpasses  26.0%.  This  high  rate  implies 
that  approximately  a  quarter  of  all  casualties  sustained  by  the  BLUFOR  would  need  to  be 
self  inflicted.  As  the  number  of  missions  conducted  approaches  13  or  that  were  conducted  in 
the  LUT  09,  the  required  fratricide  rate  falls  below  26.0%  to  approximately  25.0%. 
Conducting  additional  missions  will  only  have  a  marginal  effect  on  the  minimum  required 
fratricide  rate  as,  at  21  missions  conducted,  the  rate  falls  to  23.4%. 

In  order  to  gain  an  idea  of  whether  these  required  fratricide  rates  are  reasonable,  we 
first  needed  some  idea  of  the  average  fratricide  rate  in  actual  tactical  operations.  Precise 
fratricide  data  is  relatively  difficult  to  obtain.  However,  available  reports  placed  the  fratricide 
rates  around  13%  (Bower,  Lacey,  &  McCarthy,  2003;  Gadsden  &  Outteridge,  2006).  Using 
this  figure  as  a  guideline  we  determined  the  plausibility  of  drawing  any  statistical  conclusions 
from  the  comparative  tests.  The  minimum  mean  required  fratricide  rate  shown  in  Figure  9  is 
almost  100%  higher  than  that  experienced  during  actual  tactical  operational  conditions. 
Given  this  high  required  BLUFOR  fratricide  rate,  it  is  unlikely  that  the  current  test  set-up  will 
yield  statistically  conclusive  results. 

One  additional  point  is  worth  noting  regarding  fratricides.  First,  the  exponential-like 
nature  of  the  slope  in  Figure  9  suggests  there  are  diminishing  returns  to  conducting  an 
increasing  number  of  missions.  If  we  extended  the  analysis  to  100  conducted  missions,  the 
minimum  required  fratricide  rate  will  remain  relatively  high  at  21.0%.  Thus,  the  statistical 
returns  from  conducting  a  large  number  of  missions  may  not  justify  the  incremental  test 
costs. 

Sensitivity  Analysis 

The  analysis  performed  in  the  previous  sections  is  predicated  on  a  number  of 
assumptions.  Among  these  assumptions  are  (1)  the  variability  in  performance  measures 
across  missions  is  identical  for  both  the  SUT  unit  and  the  baseline  unit,  (2)  90%  confidence 
level  is  the  more  appropriate  confidence  level  for  the  statistical  analysis,  and  (3)  the 
performance  of  the  SUT  unit  in  the  LUT  09  is  representative  of  its  future  performance  in 
subsequent  IOT&E.  For  the  sensitivity  analysis,  we  modified  each  of  these  assumptions  and 
examined  the  corresponding  response  of  the  required  unit  mean  performance  limits  for 
statistical  significance. 
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One  key  assumption  in  the  previous  analysis  was  that  the  variability  exhibited  in 
several  of  the  SUT  performance  measures  may  be  used  as  a  proxy  for  the  variability  of  the 
baseline  unit  performance.  We  believed  this  assumption  to  be  justifiable  as  this  has  been 
the  case  in  many  prior  side-by-side  tests.  In  order  to  test  the  robustness  of  our  conclusions 
to  changes  in  variability,  we  relaxed  our  assumption  by  assuming  the  performance  variability 
of  the  baseline  unit  is  half  that  of  the  SUT  unit. 

Confidence  levels  are  often  used  to  establish  bounds  on  performance  metrics  in  the 
presence  of  uncertainty.  While  traditionally  analysis  is  conducted  at  standard  confidence 
levels  (90%,  95%,  or  99%),  the  criteria  for  selecting  confidence  levels  are  arbitrary.  To 
understand  the  sensitivity  of  our  previous  analysis  to  the  changes  in  confidence  levels,  we 
reduced  the  confidence  interval  to  80%. 

Finally,  survey  results  in  the  LUT  09  indicated  that  the  unit  claimed  the  SUT  was  not 
instrumental  in  accomplishing  missions.  Based  on  this  response,  we  reexamined  the  data 
assuming  the  values  obtained  in  the  LUT  09  represent  the  baseline  unit  instead  of  the  SUT 
unit.  Next,  we  evaluated  how  much  improvement  the  new  system  must  provide  in  order  to 
be  statistically  different. 

Table  4  shows  the  results  for  the  sensitivity  analysis  for  the  13  conducted  missions  in 
LUT  09.  The  initial  results  are  those  obtained  in  the  previous  sections  (i.e.,  initial  results 
assumed  the  LUT  09  was  representative  of  the  SUT  and  described  how  well  or  poorly  a 
baseline  must  perform  to  be  different  from  the  SUT). 

Table  4.  Results  of  Sensitivity  Analysis 


Required  Values  for  Statistical  Significance  in 
IOT&E 

Metrics 

Observed 
in  LUT  09 

Initial 

Results 

50% 

Variability 

Reduction 

Confidence 
Level  =  80% 

SUT 

Missions  Not  Accomplished 

1 

4-6 

N/A 

4 

- 

Mission  Success  Rate 

0.85 

63.2% 

N/A 

71.1% 

98.2% 

BLUFOR  Casualty  Rate 

10.1% 

12.9% 

12.1% 

11.9% 

4.7% 

OPFOR  Casualty  Rate 

22.3% 

13.7% 

16.2% 

16.7% 

31.0% 

BLUFOR  Fratricide  Rate 

14.6% 

24.5% 

21.6% 

21.2% 

7.3% 

For  the  mission  success  metrics  (missions  losses  and  success  rate),  only  two  of  the 
three  scenarios  applied.  Relaxing  the  confidence  level  to  80%  saw  the  number  of  mission 
losses  by  the  baseline  unit  fall  by  about  one  or  two  to  four  (for  both  cases  in  which  XLw  =  0 
or  XLw  =  1 )  or  around  30.8%  of  all  missions  conducted.  The  mission  success  rate  of  the 
baseline  unit  increased  to  71.1%  from  63.2%.  Interestingly,  reducing  the  variability  by  50% 
or  the  confidence  level  to  80%  exhibited  almost  identical  impacts  on  the  casualty  and 
fratricide  rates.  In  both  cases,  the  BLUFOR  casualty  rate  declined  to  just  over  1 1 .5%,  the 
OPFOR  casualty  rate  rose  to  just  over  16.0%  and  the  BLUFOR  fratricide  rate  fell  to  around 
21.0%. 


There  are  a  number  of  insights  that  we  can  draw  from  the  sensitivity  analysis.  First,  it 
is  unlikely  that  the  comparative  test  at  the  IOT&E  will  lead  to  any  statistically  relevant 
conclusion  regarding  the  BLUFOR  fratricide  rate.  Under  relaxed  assumptions,  the  fratricide 
rate  needed  to  show  a  statistical  difference  remained  high,  above  what  is  currently  observed 
in  combat  (Bower,  Lacey,  &  McCarthy,  2003;  Gadsden  &  Outteridge,  2006).  Second, 
relaxing  the  variability  by  half  and  shifting  the  confidence  interval  from  90%  to  80%  provide 
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results  that  are  more  consistent  with  the  hoped  for  operational  performance  of  the  baseline 
unit  in  the  cases  of  the  mission  success  and  BLUFOR  casualty  measures.  The  OPFOR 
casualty  measure  remained  significantly  lower  than  that  normally  observed  in  actual  combat. 
Judging  from  the  new  mission  and  BLUFOR  casualty  rates,  one  might  infer  that  there  may 
be  some  opportunity  to  produce  statistically  conclusive  results  for  the  mission  success  and 
BLUFOR  casualty  metrics 

The  sensitivity  analysis  provided  a  potentially  positive  outlook  for  gaining  conclusive 
information  about  the  ability  of  the  SUT  to  enhance  situational  awareness  evident  through 
reduced  casualties  and  improved  mission  success.  The  potential  outlook  may  support  the 
argument  for  conducting  a  comparative  test  as  planned.  However,  it  is  important  to 
understand  the  disadvantage  of  relaxing  these  two  assumptions.  Operational  tests  are  often 
complex  with  a  great  degree  of  variability  in  test  parameters.  While  relaxing  the  assumptions 
produces  plausible  results,  there  is  a  lower  degree  of  confidence  associated  with  the  derived 
conclusions.  Rephrased  in  a  more  colloquial  manner,  increasing  the  likelihood  of  drawing 
conclusions  reduces  the  confidence  in  those  conclusions. 

Recall  in  the  third  set  of  sensitivity  analyses,  we  assumed  that  data  gathered  from 
the  LUT  09  was  representative  of  the  baseline  unit  as  opposed  to  the  SUT  unit.  This  was 
done  as  units  noted  that  the  SUT  did  not  help  in  their  missions.  From  the  results  of  the 
sensitivity  analysis,  and  if  we  assumed  that  the  performance  exhibited  in  LUT  09  formed  a 
baseline,  the  mission  success  rate  indicates  that  the  SUT  unit  would  effectively  need  to  win 
all  of  their  missions  in  order  to  produce  statistically  significant  conclusions.  However,  a 
review  of  the  metric  and  the  number  of  mission  losses  by  the  SUT  unit  suggests  that  it  is  not 
possible  to  obtain  statistically  significant  results  even  with  a  100%  mission  success  rate.  The 
minimum  required  mean  SUT  BLUFOR  casualty  rate  was  extremely  low  at  4.7%  and  the 
maximum  required  SUT  fratricide  rate  to  yield  statistical  significance  at  7.3%.  These  rates 
imply  for  a  unit  with  a  starting  strength  of  140,  only  approximately  seven  casualties  and  zero 
fratricides  occur. 

The  conflicting  inferences  drawn  about  the  SUT  unit  needed  improvement  in  regard 
to  the  mission  success  measures  raised  questions  about  whether  it  is  possible  to  draw 
statistical  conclusions  given  the  current  set-up  of  the  side-by-side  test.  The  required  mean 
casualty  and  fratricide  rates  appear  to  be  optimistic.  At  this  point,  we  reserved  any  judgment 
about  the  possibility  of  the  SUT  unit  achieving  such  rates.  It  is  possible  that  there  will  be 
significant  performance  improvements  in  the  SUT  system  during  subsequent  testing. 

Conclusion 

The  objective  of  this  analysis  was  to  demonstrate  the  use  of  statistical  inference  to 
better  understand  the  potential  of  an  SUT  in  improving  the  operational  performance  of  a 
given  unit  prior  to  conducting  a  comparative  test.  Specifically,  we  wanted  to  establish 
expectations  for  the  statistical  outcome  of  a  comparative  test  by  examining  whether  the 
behavior  of  the  means  and  standard  deviations  of  the  metrics  can  reasonably  be  expected 
to  generate  differences.  Initial  results  indicated  that  while  the  comparative  test  set-up  for  the 
SUT  may  yield  statistically  significant  results  for  one  of  the  system  evaluation  metrics,  it  is 
highly  unlikely  that  evaluators  will  observe  statistical  significance  in  the  remaining  metrics. 

One  factor  often  cited  for  the  lack  of  observed  statistical  significance  is  large 
variability  in  performance  metrics  due  to  the  complexity  of  operational  tests.  While  there  is 
some  justification  for  this  statement,  there  are  other  factors,  which  may  be  adjusted  in  the 
test  structuring  to  yield  a  more  informative  test  outcome.  By  dissecting  the  underlying 
factors,  which  drive  each  metric,  we  pointed  to  potential  improvements  for  test  structuring, 
which  may  enhance  the  likelihood  of  observing  differences  in  a  greater  percentage  of  the 
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metrics.  Most  notably  was  a  recommendation  to  reconsider  the  BLUFOR  to  OPFOR  starting 
strength  ratio.  While  this  ratio  may  be  representative  of  current  field  operations,  it  is 
somewhat  inhibitive  to  understanding  the  potential  benefits  of  the  SUT  to  unit  performance. 

Finally,  the  metrics  presented  in  this  analysis  provide  a  glimpse  into  the  top  level 
performance  of  a  unit  with  the  SUT.  However,  it  is  important  to  note  that  the  analysis  did  not 
consider  qualitative  measures  of  operational  effectiveness.  Information  for  qualitative 
assessments  is  gathered  through  surveys  and  structured  interviews,  and  it  may  provide 
added  insights  not  immediately  evident  in  the  quantitative  metrics  used. 
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Required  casualty  rates  and 
mission  success  metrics  are 
consistent  with  observed  values 


Required  improved  performance  of  SUT  raised 
concerns  about  being  able  to  provide 
conclusive  results  in  a  comparative  test 


Metrics 
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Required  fratricide  rate 
remains  high 


Required  OPFOR  casualty 
rate  remains  low 


Summary 


■  Using  statistical  inference  insight  may  be  gained  about  possible  outcomes  of 
comparative  tests 

>  Guide  expectations 

>  Point  to  areas  where  test  may  need  restructuring 

>  Enable  a  more  effective  utilization  of  resources 

■  For  case  study,  it  is  likely  that  a  comparative  evaluation  of  these  quantitative 
metrics  will  lead  to  statistically  inconclusive  results  as  performance 
requirements  are  high 

>  Possible  restructuring  of  test  needed 

>  Given  current  performance  of  SUT,  a  comparative  test  may  not  be  an  effective 
utilization  of  limited  resources 

■  Extend  analysis  to  qualitative  measures  of  operational  effectiveness  which 
are  gathered  from  surveys  and  interviews 
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